3SC Engineering | Tag: risk management

It’s 2:17 AM on a Tuesday. Your phone rings. It’s the on-call SRE. You groan. You know what this means; and it’s not that your sleep just got ruined. It’s that you have to work. Something is wrong. Very wrong.

“The payment gateway is down. We’re throwing 500 errors on every checkout”

(At least the enhanced monitoring tool you fought tooth and nail for has brought some value to this disaster)

While calling into the bridge line, you make a pot of coffee, knowing its going to be a long morning? Day? Night?

For the next four hours it's bedlam. The bridge line now has 30 people on it. VPs are woken up, join the line, and start barking orders. Your peers, your friends, people whose birthday’s you celebrated are woken from their sleep.

Everyone starts staring at the monitoring dashboards. You have decided that red is NOT your favorite color. Logs are grepped. The decision is made. Roll back the “simple fix” that was deployed eight hours ago. The fix that was to only fix the formatting error on the confirmation email.

Sunrise comes (and one pot of coffee gone), and the system is back online. The post mortem is scheduled to start in 30 minutes. Just enough time to wash your face, and use the facilities. The executive summary will read: A surprise regression in the Promotions module caused an unexpected interaction with the Payments service.

Another lie has been told.

We delude ourselves into thinking that we respond to surprises. We delude ourselves into believing that the law-of-unintended-consequences catches up with everyone. That this is a part of life as an engineer.

There is no such thing as a “surprise”. There are only risks that we fail to make visible. This outage was not an accident. It WAS predictable. You are sitting on all the data necessary to predict this possibility. You just didn’t look. You Git repository holds the clues.

Let’s put on our detective hats, and walk through this post-mortem:

1. The Victim: The payments service. A service that has been rock solid for months. Stable, reliable, consistent. Owned by a team of SMEs that know it inside and out. It hasn’t been touched in months (thanks to git blame)

2. The Weapon: A one-line change in the promotions service, a small bug fix. The weapon was used yesterday.

3. The Motive: Marketing wanted to increase engagement on promotional emails sent as part of the purchase process.

So what really happened? Was it a code review failure? Unit Tests not properly written? A LGTM in a PR?

It was none of these. The code review ensured that coding standards were met. Unit tests passed, and they covered all branches. The PR comment WAS appropriate. The problem was not the code, it was not the process; it was the COUPLING between the two services.

So why was it missed:

Had you treated your Git history as the database that it can be, and not just a “remote backup”, you would have seen the warning signs a year ago. You would have discovered that:

Clue #1: Dangerous Correlation. For the past year, you would have seen that 92% of all commits to Promotions/service.js have occurred within the same 24-hour period as a commit to Payments/api/v2/validator.js. That these two files are not just friends; they are BFFs that go everywhere together. A change in one has almost always required a "defensive" change in the other. This isn't a service boundary; it's a fault line.

Clue #2: The Knowledge Silo. The Payments service has 1 SME. “Brad”, the lone engineer has been responsible for over 85% of all commits. Good news: “Brad” has been on vacation for the past 2 weeks; off-grid. The other 30 people on the call knew OF the payments service, but not how it works. The nuances, the dependencies. No one touches it, “Brad” always gets the stories during sprint planning; it “just is”. As such, no one knew why the Payment Service WAS failing, and they all brought their miners hats to the call.

Clue #3: The Churn Hotspot. You know the Promotions service is in constant flux. You know marketing likes to “test” different approaches, and measure engagement. (Hell, you wrote the conversion tracking code that is used in their weekly dashboards). You also know that many people end up picking up the stores, that there is a swamp of knowledge in the promotions service; large number of engineers, not a lot of depth.

The Results:

The outage really WASN’T a surprise. You had the data. Murphy caught up with you. It was bound to happen. You sent a high-churn, low-ownership service, at full speed, through an intersection where the stoplights are disabled. And it happened, it T-bone into the one service where the only person who knew what was REALLY happening, was on vacation, off-grid.

This wasn’t a surprise. Stop calling it that. Start calling it what it is “un-visualized / un-recognized” risk.

Your Git log is a perfect, immutable, cryptologically perfect record of every near-miss and hidden dependency in your system. Every day it collects more and more data, which allows you to discover this information.

Are you willing to do the forensic work so you can listen? What is it worth to you? How much sleep?